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ABSTRACT 

Discovery  of  biological  relationships  between  genes  is  one 
of  the  keys  to  understanding  the  complex  functional  nature 
of  the  human  genome.  Currently,  most  of  the  knowledge 
about  interrelating  genes  are  found  in  immense  amounts  of 
various  biomedical  literature.  Hence,  extraction  of  biological 
contexts  occurring  in  free  text  represents  a  valuable  tool  in 
gaining  knowledge  about  gene  interactions.  We  present  a 
textual  analysis  of  documents  associated  with  pairs  of  genes, 
and  describe  how  this  approach  can  be  used  to  discover  and 
annotate  functional  relationships  among  genes.  A  study  on 
a  subset  of  human  genes  show  that  our  analysis  tool  can 
act  as  a  ranking  mechanism  for  sets  of  genes  based  on  their 
functional  relatedness. 

Keywords 

information  retrieval,  document  clustering,  gene  relations 

1.  INTRODUCTION 

Although  most  genes  in  the  human  DNA  now  have  been 
completely  sequenced  [3],  their  functional  roles  and  the  di¬ 
verse  interrelationships  between  them  are  still  to  be  fully 
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understood.  With  the  development  of  the  DNA  microar¬ 
ray  [10],  researchers  have  a  tool  where  they  can  measure  the 
expression  levels  of  several  genes  at  a  time.  Producing  huge 
amounts  of  data,  discoveries  made  from  such  experiments 
are  published  at  an  enormous  rate  in  the  scientific  litera¬ 
ture;  thus,  giving  researchers  a  severe  information  retrieval 
challenge  in  keeping  up  to  date  in  their  fields  of  expertise. 
With  the  aim  of  structuring  existing  knowledge  occurring  in 
free  text,  biomedical  text  collections  have  been  subject  to 
extensive  research  the  last  years. 

The  problem  of  extracting  information  about  how  genes  are 
related  has  been  the  major  focus  by  many  groups,  e.g.  [5, 
8,  12,  16,  17,  18],  and  has  led  to  a  variety  of  approaches  for 
discovery  of  functional  groupings  among  genes.  Clearly,  any 
successful  method  should  be  able  to  extract  the  biological 
nature  of  the  discovered  relationships.  This  goal  has  been 
achieved  to  some  extent  by  different  efforts,  but  they  either 
rely  on  the  quality  of  documents  associated  with  genes  [12] 
or  limit  themselves  to  controlled  vocabularies  for  annotat¬ 
ing  the  relationships  [5,  16].  Furthermore,  knowledge  about 
gene  relations  often  include  several  biomedical  aspects,  i.e. 
biology,  chemistry  and  medicine.  This  fact  reflects  the  com¬ 
plex  nature  of  gene  relationships,  and  indicates  that  they 
ought  to  be  characterized  by  more  than  one  functional  con¬ 
text. 

We  propose  an  approach  that  initially  extracts  the  multi¬ 
ple  local  contexts  between  pairs  of  genes  found  co-occurring 
in  MEDLINE  abstracts.  Further,  a  global  analysis  of  local 
contexts  between  pairs  is  performed,  giving  similiar  local 
contexts  a  global  interpretation.  It  is  our  belief  that  this 
scheme  can  represent  an  efficient  way  of  discovering  func¬ 
tionally  related  genes. 

We  evaulate  our  method  on  a  subset  of  human  genes,  and 
the  results  (though  preliminary)  show  that  sets  of  genes  con¬ 
nected  by  same  global  contexts  are  functionally  similar. 

The  rest  of  the  paper  is  organized  as  follows:  The  next  sec¬ 
tion  presents  related  work  on  mining  the  literature  for  gene- 
relations.  We  then  give  a  description  of  the  models  and 
methods  used  in  our  scheme  for  finding  functionally  related 
genes.  Finally,  we  present  and  discuss  preliminary  results 
on  applying  our  approach  on  a  set  of  human  genes. 


2.  RELATED  WORK 

Detecting  gene  relations  based  on  the  co-occurrence  method¬ 
ology  was  initially  explored  by  Stapley  et.  al  [14]  in  their 
prototype  system  for  visualization  of  gene  interactions.  Later, 
the  method  was  utilized  in  a  comprehensive  manner  by  Jenssen 
et.  al  [5,  6],  who  developed  a  genome- wide  network  of  hu¬ 
man  genes.  The  co-occurrence  method  is  very  efficient  for 
its  purpose;  to  detect  gene  relations.  However,  co-occurrence 
alone  can  not  help  us  in  discovering  the  characteristics  of  the 
relation.  An  approach  of  going  beyond  simple  co-occurrence 
was  suggested  by  [5],  who  annotated  the  relations  between 
genes  detected  by  co-occurrence  with  associated  MeSH  and 
GO  terms. 

Recently,  analysis  of  the  graph  structure  inherent  in  a  co¬ 
occurrence  network  has  attracted  the  attention  of  researchers, 
e.g.  [17,  18].  Wilkinson  et  al.  [17]  employed  Girvan  and  New¬ 
man’s  process  of  finding  communities  [4]  to  discover  related 
genes.  By  picking  sets  of  genes  statistically  correlated  to 
user-selected  keywords,  components  of  a  gene  co-occurrence 
graph  are  partitioned  into  functionally  related  communi¬ 
ties.  Interesting  results  included  placing  co-occurring  genes 
into  different  communities;  demonstrating  the  fact  that  co¬ 
occurrence  does  not  always  imply  functional  relatedness. 
Wren  et.  al  [18]  took  advantage  of  statistical  properties 
of  connections  in  the  network  to  determine  the  “cohesive¬ 
ness”  of  sets  of  co-occurring  objects  (genes,  diseases,  chemi¬ 
cal  compounds  etc.).  The  technique  could  therefore  identify 
whether  a  set  of  objects  form  a  purposeful  grouping,  and 
maybe  more  importantly,  whether  members  not  in  the  set 
should  be  included. 

Based  on  domain  knowledge  from  thesauri,  Stephens  et. 
al  [16]  both  found  and  annotated  gene  relationships  by  scan¬ 
ning  sentences  for  gene  thesauri  terms.  However,  the  ap¬ 
proach  is  dependent  upon  high  quality  domain-specific  the¬ 
sauri  in  order  to  produce  good  results. 

Given  a  group  of  genes,  Raychaudhuri  et.  al  [8]  developed 
the  concept  of  neighbor  divergence  pr.  gene(NDPG)  within 
scientic  texts  to  discover  a  potential  biological  relation  in 
the  group.  The  motivation  behind  their  approach  was  to 
recognize  articles  describing  the  function  inherent  in  the 
group.  It  achieved  accurate  results  on  a  testset  taken  from 
the  yeast  organism  (79%  recall  at  100%  precision).  How¬ 
ever,  the  method  requires  that  a  list  of  relevant  articles  is 
provided  for  each  gene  in  the  organism,  and  this  requirement 
is  by  no  means  trivial.  Furthermore,  NDPG  does  not  tell  us 
the  function  among  a  set  of  genes,  it  merely  determines  if 
the  group  shares  one. 

Approaches  using  the  published  literature  as  the  main  source 
for  annotation  have  been  investigated  earlier.  With  the 
same  goal  as  [5]  of  establishing  functional  gene  relations  on  a 
genome-wide  scale,  Shatkay  et.  al  [12]  employed  document 
similiarity  search  as  basis  for  their  method.  Arguing  that 
clustering  of  co-expressed  genes  from  DNA  microarry  exper¬ 
iments  may  fail  to  give  the  true  picture  of  interrelationships 
between  genes,  they  proposed  a  complementory  method  in 
which  relationships  between  genes  are  found  and  annotated 
by  measuring  the  similarity  between  the  genes’  set  of  rel¬ 
evant  documents  in  the  literature.  The  annotation  mech¬ 
anism  involves  a  “theme-based”  probabilistic  search  [13], 


which  provides  a  summary  of  the  content  between  a  query 
document  and  its  similar  documents.  The  main  limitation 
of  this  approach  is  that  it  requires  each  gene  to  be  associ¬ 
ated  with  a  kernel  document,  capturing  most  of  the  gene’s 
functional  biology.  The  method  relies  heavily  on  the  quality 
of  these  documents,  which  may  be  hard  to  find. 

3.  METHODS 

In  this  section,  the  methods  and  models  used  in  our  ap¬ 
proach  are  described  in  more  detail. 

3.1  Overview 

Our  work  represents  a  novel  method  for  annotating  the  func¬ 
tional  contexts  that  exist  between  genes  found  co-occurring 
in  MEDLINE  records.  After  creating  a  co-occurrence  graph 
of  human  genes  from  MEDLINE,  contexts  between  genes  are 
assigned  by  local  and  global  analysis  of  documents  associated 
with  the  edges  of  the  graph.  The  documents  associated  with 
an  edge  of  the  graph  are  the  MEDLINE  abstracts  where  a 
pair  of  genes  co-occurred.  First,  documents  relating  to  the 
gene-pairs  are  clustered  into  k  local  clusters;  thus,  splitting 
literature  related  to  a  pair  into  k  contexts.  Furthermore, 
each  cluster  (context)  between  a  genepair  is  associated  with 
its  hundred  most  descriptive  features.  Viewing  this  opera¬ 
tion  within  the  context  of  the  co-occurrence  graph,  each  edge 
is  being  split  into  a  multiedge,  reflecting  multiple  relation¬ 
ships  between  the  connecting  nodes.  Using  our  terminology, 
the  co-occurrence  graph  has  been  unfolded. 

With  the  goal  of  creating  a  limited  set  of  contexts  between 
the  genes  in  our  unfolded  graph,  we  give  each  edge  in  the 
graph  a  globally  defined  context,  or  “color”.  The  colors  are 
defined  on  the  basis  of  the  total  set  of  local  contexts  oc¬ 
curring  between  the  genes.  More  specifically,  we  cluster  the 
total  set  of  descriptive  features  into  a  predefined  number  of 
clusters.  As  in  the  first  stage,  each  cluster  (color)  is  associ¬ 
ated  with  its  most  descriptive  features.  This  second  stage 
ensures  that  similar  local  functional  contexts  occurring  be¬ 
tween  any  pair  of  genes  are  given  the  same  global  context. 

Having  a  co-occurrence  graph  between  genes  as  the  only  pre¬ 
requisite,  our  approach  of  mining  gene  relations  can  hence¬ 
forth  achieve  two  major  goals: 

•  annotate  multiple  relationships  between  pairs  of  genes 
with  globally  defined  functional  contexts 

•  find  functionally  related  groups  of  genes  by  means  of 
extracting  same-colored  edges  in  the  colored  unfolded 
co-occurrence  graph 

3.2  Creation  of  co-occurrence  graph 

A  co-occurrence  network  between  human  genes  forms  the 
backbone  of  our  method.  As  shown  in  various  experiments  [5, 
6],  the  co-occurrence  method  has  proved  to  be  an  efficient 
as  well  as  valid  approach  of  detecting  meaningful  biological 
relationships  between  genes.  The  methodology  is  simplis¬ 
tic;  if  two  genes  co-occur  in  an  abstract,  they  are  assumed 
to  have  a  relationship  of  some  kind.  Our  work  is  no  at¬ 
tempt  of  copying  the  comprehensive  network  developed  by 
the  people  behind  PubGene  [5] ,  henceforth,  we  do  not  intend 
to  improve  upon  the  method  for  co-occurrence  extraction. 


In  fact,  we  only  used  HGNC1,  HUGO  Gene  Nomenclature 
Commmitte,  as  the  database  of  gene  symbols  used  in  our 
search  for  co-occurrences.  That  said,  the  nomenclature  pro¬ 
vided  by  HGNC  does  include  literature  aliases  for  a  major 
part  of  the  symbols,  and  these  were  also  being  searched  for. 
Common  abbreviations  used  in  biology  literature  (i.e.  IV, 
SD,  ABO  etc.)  that  coincided  with  gene  symbols  led  to  false 
positives,  as  experienced  by  [5,  17].  The  actual  extraction 
process  was  done  in  a  straigthforward  manner;  whenever  a 
symbol  was  found  in  a  MEDLINE  record  (title  or  abstract), 
this  was  considered  a  match  for  the  gene  associated  with 
the  gene.  A  link  was  made  between  a  pair  of  genes  if  they 
occurred  in  the  same  record,  and  the  strength  of  the  link 
was  found  by  counting  the  number  of  records  in  which  the 
pair  co-occurred. 

There  is,  however,  a  key  difference  between  our  extraction 
process  and  the  one  by  Jenssen  et.  al  [5].  Along  with  cre¬ 
ating  the  co-occurrence-based  links,  the  set  of  MEDLINE 
records  (hereafter  termed  documents)  associated  with  each 
pair  of  genes  were  stored  for  further  analysis. 

We  model  the  documents  in  our  collection  using  the  docu¬ 
ment  vector  model  [2].  This  model  considers  a  document  as 
a  set  of  representative  keywords,  index  terms.  Index  terms 
are  document  words  (mainly  nouns)  used  to  summarize  the 
semantic  contents  of  the  text.  In  order  to  reduce  the  influ¬ 
ence  of  very  common  words,  the  terms  are  weighted  with 
the  TF-IDF  (Term  Frequency-Inverse  Document  Freqency) 
strategy.  If  M  denotes  the  number  of  distinct  index  terms 
in  our  collection  of  N  documents,  each  document  i  will  be 
represented  by  a  vector  on  the  following  format: 

di  =  (Wi,l,  Wi,2,  w>,m) 

Each  weight  Wij  is  given  by  TF  x  IDF: 

•Wij  =  tjij  X  log  - 

no¬ 
where  tfij  is  the  normalized  frequency  of  term  j  in  docu¬ 
ment  i.  The  IDF  factor  is  calculated  as  log^.,  where  n:j 
denotes  the  number  of  documents  where  term  j  is  occurring. 


edge,  but  we  are  investigating  more  advanced  ways  of  de¬ 
ciding  k  (see  Section  5).  The  clustering  technique  employed 
is  bisecting  K-means.  With  the  bisecting  k-means  approach, 
a  document  collection  is  first  clustered  in  two  groups,  then 
one  of  these  groups  is  seleced  and  bisected  further.  The 
similiarity  function  used  for  the  clustering  is  the  cosine  co¬ 
efficient,  given  in  Equation  1.  A  detailed  explanation  of  the 
bisecting  K-means  clustering  technique  can  be  found  else¬ 
where,  e.g.  [15].  The  most  descriptive  features  of  a  cluster 
is  found  by  selecting  the  l  words  the  contribute  the  most  to 
the  average  similarity  between  the  documents  in  the  clus¬ 
ter.  Currently,  1=100  is  used  as  the  number  of  descriptive 
features. 

Although  we  now  cluster  each  edge  into  k= 2  clusters,  an 
extra  step  is  taken  to  certify  that  the  edge  clusters  have 
a  certain  degree  of  dissimilarity.  If  the  majority  of  each 
cluster’s  descriptive  features  are  identical,  the  edge  is  not 
clustered  into  two  clusters.  This  case  reflects  the  fact  that 
all  the  literature  discussing  the  pair  of  genes  are  basically 
referring  to  the  same  context.  In  order  to  retrieve  such  an 
edge’s  descriptive  features,  we  treat  all  its  documents  as 
belonging  to  a  single  cluster. 

3.4  Coloring  the  unfolded  graph 

The  result  of  the  first  clustering  stage  is  a  graph  with  multi¬ 
ple  edges  between  nodes,  and  where  each  edge  is  associated 
with  ten  descriptive  features.  To  assign  each  edge  a  globally 
defined  color,  we  cluster  the  total  set  of  descriptive  features 
in  the  graph  into  m  clusters  (colors).  The  variable  m  will 
reflect  how  many  functional  contexts  we  expect  to  see  on  a 
global  basis  in  the  graph,  and  is  a  factor  we  are  currently 
experimenting  with  (see  Section  5  for  further  discussion). 
As  in  the  first  stage,  we  employ  bisecting  K-means  as  the 
clustering  technique.  Furthermore,  each  of  the  m  colors  are 
given  descriptive  features  following  the  same  procedure  as 
in  Section  3.3.  A  color’s  ten  most  descriptive  features  pro¬ 
vides  a  brief  summary  of  a  global  functional  context.  Since 
each  edge  in  the  unfolded  co-occurrence  graph  now  belongs 
to  particular  color  and  its  associated  features,  the  graph  has 
been  colored. 


Similarity  between  two  documents  are  found  by  seeing  how 
well  their  two  respective  vectors  correlate,  quantified  by  the 
cosine  of  the  angle  between  them: 


cos(di,dj 


di  •  dj 

Mil  x  Mil 


(1) 


The  cosine  coefficient  will  range  from  0  to  1,  where  1  denotes 
complete  similiarity  (di  =  dj),  and  0  implies  orthogonal  vec¬ 
tors. 


3.3  Unfolding  the  co-occurrence  graph  with 
document  clustering 

Along  each  edge  in  the  co-occurrence  graph,  we  use  a  clus¬ 
tering  software  toolkit  named  CLUTO2  to  cluster  the  docu¬ 
ments  into  k  clusters.  At  the  moment,  we  use  k= 2  on  every 

1  http:/ / www.  gene.  ucl.  ac.  uk/ nomenclature/ 

2  http: / / www.  cs.  umn.  edu/karypis /cluto/ 


Given  a  clique  of  nodes  in  the  colored  unfolded  co-occurrence 
graph,  we  developed  a  simple  measure  of  “color  purity”; 
the  maximum  number  of  edges  in  the  clique  connected  by 
the  same  color.  Since  the  coloring  process  can  give  a  gene- 
pair  two  global  contexts  of  the  same  color,  two  same-colored 
edges  between  a  gene-pair  in  a  clique  were  merged  into  one 
edge.  In  that  manner,  all  the  k(k-l)/2  gene-pairs  in  a  clique 
of  size  k  were  connected  either  by  two  edges  of  different 
colors  or  by  one  edge  alone.  A  formal  expression  of  the 
purity  measure  can  then  be  given: 

~  7  CLTQTTICLX  color  EDGESc  •  2 

TflCLXC;  OiOT r  VCLCc  —  y~  . 

k(k  —  1) 

where  EDGESC  represents  the  total  set  of  edges  in  the  un¬ 
folded  colored  clique  c  of  size  k. 

3.5  GO-similarity 

We  use  the  Gene  Ontology  (GO)3  as  means  of  validating 
our  method.  Being  the  most  comprehensive  ontology  used 

3  http:  / / www.  geneontology.  org 


to  describe  the  functional  roles  of  genes,  it  is  valuable  tool 
for  assessing  whether  two  genes  are  biologically  related.  The 
terms  comprising  GO  is  organized  into  a  directed  acyclic 
graph  (DAG),  which  has  the  property  of  multiple  inheri¬ 
tance.  Hence,  every  GO  term  follows  the  true  path  rule :  if 
a  child  term  describes  a  gene  product,  then  all  its  parents 
also  apply  to  that  gene  product.  Using  EBI’s4  existing  GO- 
annotation  of  the  human  genome,  we  managed  to  associate 
10030  HUGO  gene  symbols  with  GO  terms.  In  an  effort  to 
expand  the  number  of  GO  terms  pr.  gene,  we  took  advan¬ 
tage  of  the  the  true  path  rule  inherent  in  the  ontology  graph 
structure  to  generate  greater  sets  of  GO  terms  pr.  gene. 

One  way  of  measuring  gene  functional  similiarity  would  be 
to  find  which  GO  terms  are  common  between  the  genes  in 
question.  While  this  approach  is  simple  and  intuitive,  clearly 
it  doesn’t  give  us  any  quantiative  measure  of  similarity.  Al¬ 
ternatively,  one  can  consider  each  gene  as  a  “document”, 
where  the  document  consists  of  textual  descriptions  of  GO 
terms  associated  with  it.  Furthermore,  we  can  model  each 
document  in  the  vector-space  of  GO  terms,  and  as  shown 
earlier,  this  view  gives  us  an  opportunity  to  compute  quan¬ 
titative  similarities.  Now,  the  index  terms  consists  of  all  GO 
terms  associated  with  gene  symbols.  If  there  are  a  total  of 
N  GO  terms  used  in  annotation  of  our  genes,  we  can  repre¬ 
sent  a  gene  with  the  following  GO  vector,  where  Wij  is  the 
weight  of  GO  term  j  for  gene  i: 

gl  =  (Wj.l,  Wj,2, -.,  w»,jv) 


4.  RESULTS 

Our  initial  co-occurrence  graph  contained  5799  human  genes 
connected  by  73729  edges,  each  associated  with  two  or  more 
documents.  In  order  to  make  sure  that  gene-pairs  were  rep¬ 
resented  appropriately  in  the  literature,  we  pruned  the  graph 
to  only  include  edges  with  between  10  and  100  documents. 
Furthermore,  we  kept  only  genes  that  were  annotated  with 
GO  terms,  reducing  the  graph  even  more.  Finally,  after 
removing  genes  that  were  considered  to  be  false  positives 
because  of  bad  aliases,  our  testgraph  contained  1516  genes, 
connected  by  4849  edges. 

Figure  1  shows  a  part  of  the  unfolded  co-occurrence  graph, 
each  edge  being  denoted  with  its  descriptive  features.  Note 
that  one  connection  in  this  part  of  the  graph  contains  only 
one  cluster,  this  corresponds  to  the  case  where  the  docu¬ 
ments  are  considerered  to  contain  only  one  functional  con¬ 
text.  However,  this  illustrates  a  rare  case  among  gene-pairs 
in  the  graph,  since  almost  every  pair  were  given  two  local 
contexts  during  the  first  clustering  process. 

After  coloring  the  clusters  of  edges  with  the  method  outlined 
in  Section  3.4  (using  m=100  colors),  the  clique  shown  in 
Figure  1  turned  into  the  clique  shown  in  Figure  2.  As  can 
be  seen  from  Figure  2,  some  of  the  color  descriptions  are 
fairly  similar,  and  this  observation  may  imply  that  a  color- 
scheme  with  lower  number  of  colors  should  have  been  used 
(see  Section  5  for  further  discussion). 


The  weighting  strategy  becomes  slightly  different  than  in 
the  ordinary  text  document  context.  Since  no  GO  term  is 
associated  with  a  gene  more  than  once,  the  TF  (term  fre¬ 
quency)  factor  is  omitted.  Thus,  each  u>i:j  will  only  contain 
the  IDF-part,  reflecting  how  relevant  or  specific  GO  term  j 
is  to  gene  i.  Finally,  each  gene’s  weighted  GO  vector  gl  is 
normalized  to  a  vector  of  length  1.  Using  the  cosine  coeffi¬ 
cient  used  previously  for  abstracts,  we  can  define  our  notion 
of  GO  similiarity. 


Definition  1.  Given  two  genes  i  and  j ,  represented  by 
their  weighted  normalized  GO  vectors  gl  and  g],  their  GO- 
similarity  score  is  given  by: 

GOsim(gl,  gj)  =  gl  •  g~j 


Definition  2.  Given  a  set  of  n  genes,  represented  by  their 
weighted  normalized  GO  vectors  (gl  ...  gfi),  the  average 
pairwise  GO-similarity  in  the  set  is  given  by  the  standard 
sum-of-pairs  score: 


avgGOsim(gl..gfi)  = 


2 

n(n  —  1) 


EE  GOsim(gl,g~j ) 

i=l  i>j 


(2) 


We  believe  this  definition  of  GO-similiarity  is  valid  for  our 
limited  purposes.  Lord  et.  al  [7]  used  a  similar  line  of  at¬ 
tack  when  they  explored  semantic  similiarities  across  GO  us¬ 
ing  Resnik’s  [9]  notion  of  shared  information  content.  Their 
method  was  validated  by  showing  that  semantic  GO  simil¬ 
iarity  correlated  well  with  sequence  similarity  in  the  SWISS- 
PROT  database. 

4  ftp://ftp.ebi.ac.uk/pub/databases/GO/goa 


To  validate  our  method  on  a  large  scale,  we  looked  at  all  the 
cliques  in  the  colored  unfolded  co-occurrence  graph.  Cliques 
are  fully  connected  components  of  the  graph,  and  based 
on  the  co-occurrence  assumption,  a  clique  can  potentially 
contain  a  set  of  functionally  related  genes.  By  measuring 
the  color  distribution  among  edges  in  cliques  of  size  4  and 
greater,  we  investigated  whether  this  distribution  were  re¬ 
lated  to  functional  similarity  among  the  genes  in  the  clique. 
More  specifically,  the  color  purity  measure  developed  in  Sec¬ 
tion  3.4  were  used  to  give  rankings  among  sets  of  genes.  This 
were  accomplished  by  sorting  all  the  cliques  in  the  graph 
based  on  decreasing  order  of  maxColorFrac,  and  plotting  the 
running  average  GO-similarity  of  cliques  in  this  ordering. 


To  evaluate  the  quality  of  our  approach,  we  compared  our 
color-based  ranking  with  three  schemes  that  only  employ 
local  contexts  to  evaluate  a  group  of  genes’  relatedness.  The 
first  scheme  computes  average  pairwise  document  similarity 
between  documents  supporting  each  gene-pair  in  a  clique  of 
genes: 


SIM  a  = 


2 

n(n  —  1) 


n  n 


EE 

i=  1  i>j 


Pi  n  Dj 

Di  U  Dj  ’ 


where  Di  is  the  set  of  documents  supporting  edge  i,  and  n 
represents  the  number  of  edges  in  the  clique.  The  second  one 
measures  average  pairwise  textual  similarity  between  docu¬ 
ments  supporting  each  gene-pair  in  a  clique  of  genes: 


SI  Mb  = 


2 

n(n  —  1) 


EE  cos(mdi,  mdj), 

i=  1  i>j 


where  n  denotes  the  number  of  edges  in  the  clique,  and  mdi 
represents  edge  V s  metadocument,  a  combined  document  of 
the  documents  supporting  edge  i.  The  last  scheme  computes 


NPY 


ADCYAP1 


CRH 


CRH-ADCYAP1 

0:  gh  releas  srih  somatotrop  pacap38  corticotrop  hormon  turbot  teleost  pituitari 
1 :  pacap  pituitari  peptid  cyclase-activ  polypeptid  adenyl  hormon  effect  placenta  rat 
CRH-GHRH 

0:  gh  hormon  releas  sleep  srih  hypothalam  cortisol  pituitari  microgram  secret 
CRH-NPY 

0:  leptin  food  hypothalam  intak  neuropeptid  energi  feed  nucleu  obes  orexigen 
1 :  neuropeptid  releas  dsip  gh  cortisol  suicid  hormon  plasma  antidepress  acth 
GHRH-ADCYAP1 

0:  gh  pacap  releas  srih  somatotrop  secret  pituitari  miapaca-2  antagonist  hormon 
1 :  pacap  peptid  pituitari  fish  cdna  polypeptid  goldfish  adenyl  gh  receptor 
GHRH-NPY 

0:  gh  neuron  srih  ghs-r  releas  hypothalam  hormon  arcuat  pituitari  mma 
1 :  iugr  intak  food  punish  growth  ep  peptid  retard  fr  unpunish 
NPY-ADCYAP1 

0  pacap  nerv  fiber  fibr  peptid  neuropeptid  pacap-27  pacap-38  ganglia  sp 
1  pacap  pacap38  neuron  peptid  pituitari  gpalpha  releas  adenyl  neuropeptid  sympathet 


Figure  1:  A  clique  of  size=4  in  the  unfolded  co-occurrence  graph  showing  genes  NPY,  GHRH,  ADCYAP1 
and  CRH.  Each  edge’s  documents  have  been  clustered  into  k=2  or  k=l  clusters.  Also  shown  is  the  most 
descriptive  features  (stemmed)  of  each  cluster  in  the  clique. 


Color  67:  pituitari  peptid  gnrh  srih  prolactin  somatotrop  relea  hormon  secret  gastrin 
Color  75:  cow  supplement  oil  rumin  steer  forag  diet  intak  digest  fed 


Figure  2:  The  same  clique  as  in  Figure  1,  now  each  cluster  has  been  replaced  with  a  colorlabel,  the  global 
functional  context.  The  descriptive  features  (stemmed)  of  each  color  is  also  given. 
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Figure  3:  Running  average  GO-similarity  in  four  different  rankings  of  cliques.  The  rankings  are  based  on  four 
different  methods  for  determining  functional  relatedness  of  the  genes  in  a  clique:  maxColorFrac  =  Maximum 
fraction  of  clique-edges  covered  by  the  same  color,  SIM_A  =  Average  pairwise  document  similarity  among 
the  edges  in  the  clique,  SIM_B  =  Average  pairwise  textual  similarity  among  between  documents  supporting 
the  edges  in  the  clique,  and  SIM_C  =  Textual  similarity  of  the  union  of  documents  among  edges  in  a  clique. 


Cliquesize 

Multicolor  fraction 

4 

56.5% 

5 

66.7% 

6 

79.9% 

7 

85.5% 

Table  1:  Average  fraction  of  multicolored  edges  in 
cliques. 

the  average  textual  similarity  of  the  union  of  documents  sup¬ 
porting  each  gene-pair  in  a  clique  of  genes: 

k  k 

SIMc  =  jjj?- 

'  '  i=  1  i>j 

where  k  is  the  number  of  unique  documents  in  the  clique, 
and  di  represents  document  %  in  this  unique  set. 

As  can  be  seen  from  Figure  3,  on  5%  of  the  cliques  ranked 
best  by  the  different  methods,  our  scheme  discovers  more 
functionally  related  sets  of  genes  than  the  other  methods. 
However,  on  the  remaining  cliques,  the  performance  of  our 
scheme  does  not  persist  in  the  same  manner,  and  the  reason 
for  this  is  currently  being  investigated. 

To  give  an  indication  of  how  multicolored  our  cliques  are, 
Table  1  shows  the  average  fraction  of  multicolored  edges  in 
cliques  of  different  sizes.  Considering  the  fact  that  nearly 
every  gene-pair  in  our  unfolded  co-occurrence  graph  were 
assigned  two  edges  (local  contexts),  the  global  coloring  has 
made  sure  that  similar  local  contexts  are  given  same  global 
contexts;  representing  the  samecolored  fraction  of  edges.  So, 
even  though  the  majority  of  genes  in  cliques  are  connected 
with  different  global  contexts,  our  approach  can  still  find  the 
cliques  with  the  most  functionally  related  genes. 

5.  DISCUSSION  AND  FURTHER  WORK 

There  are  several  limitations  to  our  approach,  and  it  is  cur¬ 
rently  being  explored  in  different  ways.  Document  cluster¬ 
ing  represents  a  high-level  method  for  the  problem  of  finding 
functional  contexts  between  genes,  as  it  does  not  involve  any 
form  for  advanced  NLP  processing.  Thus,  results  should  give 
perspective  rather  than  detailed  knowledge.  The  descriptive 
features  associated  with  the  global  contexts  exemplifies  this 
in  not  being  very  detailed. 

The  number  of  local  contexts  that  are  likely  to  exist  between 
a  pair  of  genes  will  be  dependent  upon  how  much  research 
and  published  literature  there  is  about  the  pair,  and  this 
varies  widely  for  different  pairs.  Providing  each  edge  with 
an  estimate  of  k,  the  number  of  local  contexts  likely  to  exist 
between  the  connecting  genes,  will  give  the  clustering  pro¬ 
cess  a  higher  degree  of  validity.  Intuitively,  the  number  of 
MEDLINE  records  between  a  pair  will  give  some  indication 
for  k.  Using  the  MeSH5  terms  associated  with  each  MED¬ 
LINE  article  may  also  be  of  importance.  Sehgal  et.  al  [11] 
recently  deveveloped  MeSH  profiles  of  topics  in  MEDLINE 

5  Every  MEDLINE  record  is  associated  with 
Medical  Subject  Headings  (MeSH)  terms.  See 

http://www.nlm.nih.gov/mesh/meshhome.html  for  more 
information 


collections.  Developing  a  MeSH  profile  for  a  genepair  can 
act  as  a  heuristic  leading  to  the  right  size  of  k. 

We  will  also  work  on  methods  for  determining  the  appropri¬ 
ate  number  of  global  functional  contexts  (colors)  for  a  given 
set  of  gene  pairs.  At  the  moment,  we  experiment  with  differ¬ 
ent  colorschemes,  and  evaluate  a  scheme’s  goodness  based  on 
empirical  observations  of  the  specificity  of  the  different  col¬ 
ors’  descriptive  features.  A  more  theoretical  procedure  for 
this  assessment  would  be  beneficial  for  the  method’s  appli¬ 
cability.  Factors  such  as  graph  size  and  functional  diversity 
among  the  genes  in  the  graph  will  play  a  significant  role  in 
determining  the  right  size  of  m. 

Our  results  have  shown  that  groups  of  highly  “GO-similar” 
genes  are  connected  with  similar  global  functional  contexts. 
However,  GO-similarity  may  not  give  the  whole  true  pic¬ 
ture  of  a  set  of  genes’  relatedness  with  respect  to  MED¬ 
LINE  records.  Since  the  literature  about  gene  relations  are 
discussed  in  a  variety  of  contexts,  the  functional  contexts 
assigned  to  a  pair  of  genes  will  represent  a  broad  notion  of 
biomedical  knowledge.  GO  terms,  on  the  other  hand,  are 
specific  and  merely  related  to  genes’  biological  processes, 
molecular  function  and  cellular  component.  Hence,  some 
cliques  may  appear  with  high  maxColorFrac  (implying  context- 
related  genes)  even  though  their  GO-similarity  is  low. 

The  mechanism  for  selecting  potential  sets  of  related  genes 
in  the  graph  will  influence  the  functional  discoveries  among 
the  genes.  Although  the  results  by  using  cliques  are  promis¬ 
ing,  tracking  same-colored  connected  compontents  might 
give  other  interesting  findings.  Moreover,  our  current  mea¬ 
sure  for  the  functional  relatedness  of  genes  in  a  clique,  max¬ 
ColorFrac,  maybe  too  simple  for  capturing  the  properties 
between  the  set  of  genes.  A  closer  investigation  of  the  color 
distribution  in  the  clique  might  reveal  other  functional  rela¬ 
tions. 

As  noted  earlier,  the  co-occurrence  process  has  not  been  our 
area  of  focus;  thus,  our  initial  graph  did  possibly  include 
more  false  positives  than  desirable.  Badly  designed  gene 
symbols,  coinciding  with  other  abbreviations  in  the  litera¬ 
ture,  is  a  matter  of  great  frustration  among  text  miners  in 
biology.  Recently,  an  approach  to  address  and  resolve  such 
symbol  ambiguities  was  proposed  by  Adar  [1], 
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